Triton 程式設計入門：效能悖論：為什麼正確的程式碼反而很慢

這效能悖論指出，即使一個數學上完美的核心運算（例如 $out = x + y$），若未能分攤 GPU 硬體的固定成本，實際表現反而可能劣於 CPU 迴圈。這種現象通常體現在 啟動稅。

1. 「正確性」的迷思

功能上的正確性並不等同於效率。雖然您的 Triton 程式碼可能正確地將工作分配到數千個線程中，但如果總工作量（N）過小，GPU 仍會處於未充分利用狀態。硬體花費在狀態切換上的時間，遠多於實際的運算時間。

2. Python 測量陷阱

使用 Python 進行 GPU 程式碼的效能測試時 time.time() 具有風險。GPU 呼叫是 非同步；Python 只是排隊指令並繼續執行。若無 torch.cuda.synchronize()，您測量的是排隊時間。加上同步後，您測量的是 主機至裝置延遲，其長度通常比核心執行時間長達十倍。

3. 延遲與吞吐量

要克服此悖論，必須提供足夠的工作量來「隱藏」啟動延遲。這正是從 延遲受限 模式（受制於 CPU-GPU 介面）轉向 吞吐量受限 模式（受限於 GPU 記憶體或運算能力）。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.